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ABSTRACT 

Identification of genomic regulatory elements is 
essential for understanding the dynamics of cellular 
processes. This task has been substantially facilitated 
by the availability of genome sequences for many 
species and high-throughput data of transcripts and 
transcription factor (TF) binding. However, rigorous 
computational methods are necessary to derive 
accurate genome-wide annotations of regulatory 
sites from such data. SwissRegulon (http:// 
swissregulon.unibas.ch) is a database containing 
genome-wide annotations of regulatory motifs, pro- 
moters and TF binding sites (TFBSs) in promoter 
regions across model organisms. Its binding site pre- 
dictions were obtained with rigorous Bayesian prob- 
abilistic methods that operate on orthologous regions 
from related genomes, and use explicit evolutionary 
models to assess the evidence of purifying selection 
on each site. New in the current version of 
SwissRegulon is a curated collection of 190 mamma- 
lian regulatory motifs associated with ^340 TFs, and 
TFBS annotations across a curated set of -^35000 
promoters in both human and mouse. Predictions of 
TFBSs for Saccharomyces cerevisiae have also been 
significantly extended and now cover 158 of yeast's 
-^180 TFs. All data are accessible through both an 
easily navigable genome browser with search func- 
tions, and as flat files that can be downloaded for 
further analysis. 

INTRODUCTION 

The study of gene regulatory networks is a central area of 
systems biology, and one of the crucial steps in the recon- 
struction of gene regulatory networks is the identification 
of functional regulatory sites in ctv-regulatory regions 
genome wide. During the past decades, a combination of 
developments in experimental and computational 



methodologies has dramatically improved our ability to 
identify the binding sites of transcription factors (TFs) 
on a genome-wide scale. On the experimental side, the 
development technologies such as chromatin immuno- 
precipitation foUowed by micro-array hybridization or 
next-generation sequencing (ChlP-chip and ChlP-seq) 
[e.g. (1)] has made it possible to comprehensively identify 
short genomic regions at which particular TFs are bound in 
a given experimental condition. In parallel, protein array 
technology (2) can be used in high throughput to map the 
binding specificities of TFs in vitro. At least as important as 
these experimental developments have been the develop- 
ment of computational methodologies for the inference of 
TF binding specificities and the mapping of TF binding 
sites (TFBSs). The most advanced current methodologies 
typically make use of rigorous Bayesian probabihstic 
methods to analyze high-throughput biological data, and 
incorporate comparative genomic analysis to assess the 
functionality of putative sites, often involving explicit 
models of sequence evolution and the effects of natural 
selection, see (3) for a review. 

Our group has been involved for more than a decade in 
the development of computational methodologies for the 
inference of TF binding specificities and annotation of 
functional regulatory sites genome wide (3-11). Using 
these methods, we have been curating a number of 
resources in our SwissRegulon online database (12), 
including collections of position-specific weight matrices 
(WMs), promoters and predictions of TFBSs across 
proximal promoter regions genome wide in a number of 
model organisms. At the time of our original report on the 
SwissRegulon database, the database contained TFBS 
predictions for yeast and 17 prokaryotic organisms. As 
we will detail below, in the intervening 5 years, the 
database has been significantly extended in several ways. 
Most importantly, SwissRegulon now contains annota- 
tions of functional TFBSs for 190 regulatory motifs, rep- 
resenting the binding specificities of ~340 TFs, across 
proximal promoters genome wide in both human and 
mouse. In addition, the TFBS annotations for 
Saccharomyces cerevisiae have been significantly 
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extended, and now include predictions for 158 of yeast's 
'^180 TFs (13), making it the most comprehensive anno- 
tation of genome-wide binding sites available for any 
organism. We have also completely overhauled the user 
interface, implementing a new version of the genome 
browser, and adding a number of search functionahties 
that significantly improve user friendhness. Besides 
allowing users to browse annotations for promoters, 
genes or regulators of interest, we also make all data avail- 
able for download in flat file format, including promoter 
annotations, curated collections of WMs and all TFBS 
predictions. Finally, the SwissRegulon web site also 
provides access to various tools that allow users to 
analyze their own data, including source code of 
software used in the TFBS predictions (5,11), and onhne 
tools. 

RESULTS 

TFBS annotations in 17 prokaryotic genomes 

The TFBS annotations for prokaryotic genomes have 
remained largely unchanged compared with the previously 
reported version of SwissRegulon. With the exception of 
Escherichia coli, the predictions in the 16 other prokary- 
otic genomes are based on an algorithm for identifying 
sequence segments in intergenic regions that are under 
purifying selection (IRUS) described previously (6). For 
E. coli, we curated a set of WMs using our algorithm for 
probabihstic clustering of sequencess (PROCSE) (4) as 
described previously (12), and used MotEvo to predict 
TFBSs in intergenic regions genome wide. 

S. cerevisiae TFBS annotation 

Using a combination of data from ChlP-chip and in vitro 
binding assays, we have curated a collection of WMs rep- 
resenting the binding specificities of 158 TFs from S. 
cerevisiae (14). We constructed multiple ahgnments of all 
intergenic regions in S. cerevisiae, i.e. all sequence 
segments between annotated coding regions, with the 
orthologous regions in Saccharomyces paradoxus, 
Saccharomyces mikatae, Saccharomyces kudriavzevii and 
Saccharomyces bayanus. Using our recently updated 
MotEvo algorithm (11), we then predicted functional 
regulatory sites for these 158 TFs, consisting of 
>400000 sites. Notably, using this annotation we un- 
covered several striking features of the 'grammar' of 
yeasts regulatory code (10), and we have also used this 
annotation to investigate the effects of c«-regulatory poly- 
morphisms on gene expression (14). Of particular interest, 
in that study, we also provided evidence that the predicted 
regulatory interactions between TFs and target promoters, 
based on our TFBS predictions, are up to an order of 
magnitude more accurate than those obtained directly 
from ChlP-chip experiments. 

In recent work (E. A. Ozonov and E. van Nimwegen, 
submitted for publication), we have been using rigorous 
biophysical models to analyze the competition between 
TFs and nucleosomes for binding to DNA, and have 
investigated to what extent observed nucleosome position- 
ing in yeast can be explained by the competitive binding of 



TFs. The results of these investigations include predicted 
nucleosome occupancies, as weU as occupancies of individ- 
ual TFs in YPD genome wide, i.e. including across coding 
regions. These genome-wide TF occupancy profiles, as well 
as experimentally measured nucleosome occupancies (15), 
are also made available through SwissRegulon. 

Human and mouse data 
Promoterome 

Our genome-wide predictions of TFBSs in human and 
mouse originated in our analysis work in the context of 
the FANTOM4 project (7,16,17). As part of this project, 
deepCAGE sequencing of transcription start sites (TSS) 
across different tissues in human and mouse was obtained 
in high throughput. We developed several novel proced- 
ures to analyze these TSS data and obtained hierarchical 
mammahan 'promoteromes' consisting of individual TSSs 
and transcription start clusters (TSCs) of nearby TSSs that 
are co-expressed across different conditions (8). We have 
further extended this set of human and mouse promoters 
to include promoters of transcripts that are not expressed 
in the cell types sampled by the deepCAGE data. In par- 
ticular, we included 5' ends of messenger RNAs from the 
University of Cahfornia at Santa Cruz (UCSC) database 
(18), which were mapped to the human and mouse 
genomes using the BLAST-like alignment tool (BEAT) 
(18). To avoid transcripts whose 5' ends are misahgned, 
we filtered out those for which >25 bp at the 5' end of 
the transcript were unaligned. Subsequently, we then 
integrated the TSCs based on the deepCAGE data with 
these 5' ends using the foUowing iterative clustering pro- 
cedure: at each step the nearest pair of clusters is fused, 
with the constraint that there can be at most one TSC per 
cluster (because different TSCs by construction are not 
co-expressed), and that the distance between merged 
clusters cannot be > 150 bp (i.e. we use a distance cut-off 
roughly corresponding to the amount of DNA wrapped 
around a single nucleosome). The resulting reference set of 
promoters contains ~36000 promoters in human and 
~34 000 promoters in mouse. The promoteromes are 
available both in flat file format and through the 
genome browser. 

Regulatory motifs 

Using a combination of data from the JASPAR (19) and 
TRANSFAC (20) databases, other motifs from the litera- 
ture, and motifs obtained using our own analysis of 
ChlP-chip and ChlP-seq data, we curated a data set of 
190 position-specific WMs that represent the binding 
specificities of ~340 TFs in both human and mouse. The 
curation included reducing redundancy by fusing WMs 
that are similar, have dominantly overlapping binding 
sites or are associated with TFs sharing highly similar 
DNA-binding domains. WMs were also refined by itera- 
tively performing TFBS predictions in all proximal pro- 
moters genome wide (as described later). A detailed 
description of the curation procedure was provided in 
the supplementary materials of (7), and an updated 
version is provided in the documentation section of the 
web site. Both the WM collections and the associated 
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mapping to human and mouse TFs are available in flat file 
format. 

TFBS predictions in proximal promoters 

For every promoter in the collection, we extracted 1 kb of 
DNA sequence from the genome, centered on the most 
highly expressed TSS of the promoter, and extracted 
orthologous sequences from human, mouse, rhesus 
macaque, dog, cow, horse and opossum, using pairwise 
genome mappings from the UCSC database (18). The 
sets of orthologous sequences were then multiply ahgned 
with T-Coffee (21), and TFBSs were predicted on these 
multiple alignments. 

To predict TFBSs for each of the 190 mammahan regu- 
latory motifs, we used our MotEvo algorithm (11). As 
described in (8), promoters in human and mouse naturally 
fall into two categories according to their sequence com- 
position: promoters associated with CpG islands, and 
those not associated with CpG islands. We separated the 
multiple alignments associated with each class of pro- 
moters and performed TFBS predictions separately for 
each class. We have observed that different TFs have 
clearly distinct preferences in the positioning of their 
sites with respect to the TSS, and our TFBS predictions 
take these preferences explicitly into account. For each 
WM and each promoter class, we initialize a 
position-dependent prior distribution with a uniform dis- 
tribution and perform an initial round of TFBS predic- 
tions using MotEvo. Using expectation maximization, we 
then iteratively update the position-dependent prior 



distribution of site frequencies and the TFBS predictions, 
until the position-dependent prior converges. Figure 1 
(right panel) illustrates the inferred position-dependent 
profile for the motif NHLH1,2. Note that sites for this 
motif are more abundant in high-CpG promoters than 
in low-CpG promoters, and that, especially in high-CpG 
promoters, the sites preferentially occur immediately 
downstream of the TSS. 

In the final predictions, each TFBS in each promoter is 
characterized by a posterior probabihty that rigorously 
incorporates the quality of its match to the WM, the 
evidence for purifying selection on this binding site from 
the multiple ahgnments and its position relative to TSS. 
For human (UCSC assembly hgl9), MotEvo reported 
> 1 320 000 sites in ~36 000 promoters. For mouse 
(UCSC assembly mm9), MotEvo reported >1 180000 
sites in ~34000 promoters. We also provide regulatory 
site annotations for an older human assembly (UCSC 
assembly hgl8), which, currently, is still used by a signifi- 
cant number of researchers. Clicking on a predicted site in 
the genome browser leads to a separate page with detailed 
information on the site, as shown in Figure 1 . This allows 
users to, among other things, investigate the precise con- 
servation of the site across mammals. 

MotEvo prediction algorithm 

The binding site predictions in SwissRegulon are made 
using our MotEvo software, which is an integrated suite 
of Bayesian probabihstic methods for the prediction of 
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Figure 1. Information provided for each predicted site. Left panel: listed are the name of the regulatory motif (NHLH1,2), the identification number 
of the site (NHLH1,2-1252160), the source of the prediction (MotEvo), the length of the site (12 bp), its posterior probability (0.971), the sequence of 
the site and the orthologous sites from other organisms aligned to it. In this case, orthologous sites of this site occur in mouse (iTim9), dog 
(canFam2), cow (bosTau6) and opossum (inonDomS). A sequence logo of the WM is also shown. Right panel: position-dependent TFBS density 
for this motif The figure shows the probability to find a TFBS for the TF NHLH motif as a function of position relative to TSS in both high-CpG 
(green) and low-CpG (red) promoters. Listed are also the TFs associated with the motif (NHLH 1 and NHLH2), the promoters that are putatively 
driven by this site and finally the sequence of the site. 
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TFBSs and inference of regulatory motifs from multiple 
alignments of phylogenetically related DNA sequences 
(11). Similar to a few other methods for TFBS prediction 
(22,23), MotEvo uses an exphcit evolutionary model for 
the evolution of sequence segments that are under purify- 
ing selection to maintain their affinity for a given TF. 
However, in contrast to these methods, MotEvo 
robustly deals with binding sites that are only conserved 
in a subset of the species, or that appear missing because 
of local errors in the multiple alignments, which strongly 
improves performance (11). In addition, MotEvo takes 
into account that there may be many segments in the 
multiple ahgnments that are under strong purifying 
selection, but not for any of the known regulatory 
motifs, and models this by rigorously integrating over 
the space of possible WMs. This 'unknown functional 
element' model is also used to predict putative regulatory 
sites for genomes for which no regulatory motifs are avail- 
able. Finally, MotEvo can also be used to perform 
enhancer finding by searching for clusters of binding 
sites for a given subset of WMs, similar to the functional- 
ity provided by the Stubb algorithm (24), but generalized 
to arbitrary multiple ahgnments. The MotEvo software 
package is available for download from the 
SwissRegulon web site. 

Quick search functionality 

We provide a convenient way of searching through our 
collection of mammahan promoters and motifs from the 
main page of the SwissRegulon using case insensitive 
search by keyword. The keyword may be a gene name, a 
transcript identifier, an entrez gene identifier, a motif iden- 
tifier or a general keyword that occurs in the description of 
the feature of interest, e.g. 'transferase' or 'regulator of 
chromatin'. The user may also use partial keywords like 
'PAX' to find matches for all PAX TFs. 

The search results are presented in a hierarchical form, 
initially hsting all organisms for which matches 
were found, and the number of matches is shown in 
each of the following categories: matches to motifs, 
matches to genes or transcripts associated with a promoter 
or matches within the description of genes associated with 
a promoter. Chcking on one of the categories of matches 
expands the hst showing summary information for all 
matches. This summary information is in turn clickable 
and takes the user to pages with detailed information on 
the match. Matches to promoters are linked to the corres- 
ponding page in the genome browser, showing all pre- 
dicted regulatory sites in the neighborhood of that 
promoter. Matches to gene names, transcript accessions 
and gene description are also linked to the corresponding 
NCBI pages. 

Motifs are linked to pages with extensive information 
on the motif including a sequence logo, the corresponding 
WM, a figure containing its position-dependent site 
frequencies in high-CpG and low-CpG promoters and a 
sorted table hsting all promoters that have predicted 
TFBSs for the motif. 



Genome browser interface 

In the current version of SwissRegulon, we use the 
updated version of Generic Genome Browser (25) 
(version 2.45) as an interactive front end to the 
database. The new version of GBrowse operates much 
faster than previous versions and allows much more easy 
navigation across the genome. The page layout is similar 
to the previous version but includes a few changes. There 
is a toolbar on the top of the page giving access to various 
operations like exporting current track data in different 
formats, sharing data and getting help on GBrowse. 
Underneath this is a tab bar giving access to different 
panels: the browser itself, a panel for selecting which 
tracks to display, a snapshot panel for managing book- 
marked regions, a panel that allows users to upload his/ 
her own tracks and a panel for setting preferences. 

Below the tab bar is a short 'Landmark or region' form, 
where users can either type a search term (e.g. a gene 
name) or explicitly specify a pair of genomic coordinates. 
The genomic region of interest is shown hierarchically in 
three panels: 'Overview', 'Region' and 'Details'. The 
'Details' panel shows the chosen tracks in the genomic 
region of interest. Each track can be easily turned on/off 
or customized by clicking on it. All displayed features are 
clickable and link to more detailed information about the 
feature. 

GBrowse offers a number of convenient ways for quick 
navigation. There are self-explanatory 'ScroU/Zoom' 
buttons and a drop-down menu for zooming to predefined 
resolutions. In addition, the 'Overview' and 'Region' 
panels have rulers that allow the user to select a region 
to zoom and jump to the selected region. Selecting a 
region in the 'Details' panel brings a menu with a few 
available operations. Clicking to a ruler in any panel 
re-centers the view to the selected location. Placing the 
mouse pointer on any of the features in the 'Details' 
panel brings up more detailed information about the 
feature, which may include a preview of the page that 
chcking on the feature links to. Figure 2 shows an 
example browser window that illustrates several of these 
features. 

Data representation within the genome browser 

Annotated regulatory sites are displayed as boxes with an 
arrow inside (Figure 2), which indicates the strand of the 
site, and is labeled by the name of the TF(s) recognizing 
the site (when known). The color of the box indicates the 
site's posterior probability, i.e. a more intense color indi- 
cates higher probability. The pop-up window that appears 
when placing the mouse pointer on a site contains the 
motif name, its posterior probability, the site sequence 
and a motif logo (if available). Clicking on a regulatory 
site opens a new window with detailed information about 
the site as described earlier. 

Promoters are displayed as an arrow indicating the 
strand of the promoter (Figure 2) and are labeled with a 
unique identifier. A promoters' pop-up window shows hsts 
of genes and transcripts associated with the promoter. 
Clicking on the promoter opens a new window with 
detailed information on the promoter including its 
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Figure 2. The main browser panel showing a region around the promoter of the PARD3B gene in human. In this example, five tracks are shown: 
transcripts, promoters, TSC, TSS and TFBSs. The figure demonstrates selection of a region with the mouse and the associated drop-down menu with 
options. It also illustrates the pop-up windows with information about the promoter and one of the TFBSs, which will appear when placing the 
mouse pointer on these features. 



identifier, coordinates, strand, associated transcripts with 
corresponding gene names and information about eacli 
associated gene. 

TSCs are displayed as arrows indicating their strand 
and a unique identifier. Their pop-up windows show the 
associated promoter and the position of the most highly 
transcribed TSS within the cluster. Finally, our annotation 
data is overlaid with transcript annotation from UCSC 
(human and mouse) and Genbank annotation for other 
organisms. 

Downloads 

AU data are available for download. Genome annotations 
(promoters, TFBSs, TSCs and TSSs) are provided in GFF 
format (http : / /www. sequenceontology . org/resources /gff3 . 
html). The WM annotation includes a file with WMs in 
the standard TRANSFAC format, and a plain text file 
containing motif-to-gene associations. The nucleosome 
occupancy data is provided in wig format (http:// 
genome.ucsc.edu/goldenPath/help/wiggle.html). The de- 
scription of the different formats can be found in the 
'Documentation' section of the SwissRegulon web site. 
The 'Software' section of the web site provides downloads 
for motif and binding site prediction software, e.g. 
MotEvo (11) and PhyloGibbs (5). The latter also has a 
web interface, which can also be accessed through the 
SwissRegulon web site. Finally, there is also a link to a 
web-server for our Integrated Motif Activity Response 
Analysis (ISMARA). ISMARA allows users to 



automatically analyze their gene expression (microarray 
or RNA-seq) or ChlP-seq data in terms of our 
genome-wide predicted binding sites, with the aim of iden- 
tifying the key TFs, their activities and their targets, in a 
given system of interest. 



FUTURE DEVELOPIVIENTS 

For the coming years, the key updates and extensions that 
we intend to implement are the following. First, an im- 
portant model organism that is currently missing from 
SwissRegulon is the fruit fly Drosophila melanogaster . 
Our curation of Fly regulatory motifs and genome-wide 
predictions are already in an advanced stage of comple- 
tion, and we expect to be able to offer genome-wide TFBS 
annotations for D. melanogaster in the near future. We are 
also in the course of updating our regulatory site predic- 
tions for E. coli, including a newly curated set of WMs, 
and expect to be able to provide these fairly soon. 

A key limitation of SwissRegulon's TFBS annotations 
is that, in multicellular eukaryotes, the predictions are 
limited to promoter regions. Although these regions 
hkely contain a significant fraction of relevant regulatory 
sites, it is well known that many important regulatory sites 
are contained in distal cw-regulatory modules (or enhan- 
cers) (26). Recent developments in high-throughput 
mapping and analysis of chromatin state along the 
genome have uncovered that distal regulatory regions 
can be recognized by their DNase I sensitivity (27), 
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methylation status (28) and particular combinations of 
histone modifications (29), allowing a more systematic 
mapping of distal cr^-regulatory modules. Based on such 
information, we are currently curating a number of sets of 
distal regulatory regions and expect to be able to provide 
TFBS predictions for these sets in the near future. 

SwissRegulon currently provides an overview page for 
each regulatory motif that, in particular, provides a sorted 
list of all promoters/genes targeted by the motif. We 
intend to develop similar pages for each individual 
promoter/gene. These pages will thus contain an easy 
overview of aU 'regulatory inputs' that are predicted for 
a given promoter or gene of interest. 

Another crucial factor limiting the completeness of 
genome-wide TFBS predictions is the fact that, for many 
TFs, the sequence specificity is unknown. However, with 
the dramatically decreasing sequencing costs, and the 
more easily accessible protocols for ChlP-seq analysis, 
the number of available ChlP-seq data-sets is increasing 
rapidly. We have developed an automatic pipehne for pro- 
cessing ChlP-seq data, identifying high-quality binding 
peaks, and using motif inference programs such as 
PhyloGibbs to infer regulatory motifs from such data 
sets. In the near future, we intend to use this automated 
pipeline to significantly expand the number of TFs for 
which regulatory motifs are available. 

Finally, our new search function has proven itself as 
useful tool for quick access to the information but cur- 
rently only contains information from the annotations of 
human and mouse, and we intend to extent it in the near 
future to include all eukaryotic and prokaryotic species 
that are in the database. 
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